\section{Evaluation}\label{sec:eval}
%\subsection{Goals}\label{sec:goals}
We set out to  evaluate how computer-generated responses compare to human responses in their perceived {\em human-likeness} and {\em relevance}. More in particular, different system variants are compared to investigate what makes responses seem more human-like or relevant.

\subsection{Materials}\label{sec:material}
Our empirical evaluation concentrated on topics related to mobile telephones, specifically   Apple's iPhone and devices based on the Android operating system. %Using an RSS-feed data-retrieval procedure,
We collected 300 articles from this domain to train the topic models on, settling on 10 topics models. %that were most prominent in the topic model distributions
Next, we generated a set of user agendas referring to the same 10 topics. Each agenda is represented by a single keyword from a topic model distribution  and a sentiment value 
%that can be positive or negative,  as well as strong, medium or neutral: 
$sentiment_t\in\{-8, -4, 0, 4, 8\}$.
Finally, we selected the 10 new articles and generated a pool of 1000 responses for each, comprising 100 unique responses for each combination of $sentiment_t$ and system variant (i.e., with or without a knowledge base). Table~\ref{tab:sample_generation} presents an example response for each such combination. In addition, we randomly collected 5 to 10 real, short or medium-length, online human responses to each article.
% In our experimental setup, each document is represented by its conclusion section, and is assigned a random agenda. The document+agenda are assigned human responses and computer responses, which are then evaluated as human-like or computer0like by human readers,

\begin{table*}[t]
\center
\scalebox{0.9}{
\begin{tabular}{|r|l|p{15.4cm}|}
\hline
 Sent.                & KB    & Response  \\ \hline
\multirow{2}{*}{$-$8} & No    & Android is horrendous so I think that the writer is completely correct!!!  \\ %\hline
                      & Yes   & Apple is horrendous so I feel that the author is not really right!!! iOS is horrendous as well.  \\ \hline
\multirow{2}{*}{$-$4} & No    & I think that the writer is mistaken because apple actually is unexceptional.  \\ %\hline
                      & Yes   & I think that the author is wrong because Nokia is mediocre. Apple on the other hand is pretty good ...  \\ \hline
\multirow{2}{*}{0}    & No     & The text is accurate. Apple is okay.  \\ %\hline
                      & Yes    & Galaxy is okay so I think that the content is accurate. All-in-all samsung makes fantastic gadgets.  \\ \hline
\multirow{2}{*}{4}    & No     & Android is pretty good so I feel that the author is right.  \\ %\hline
                      & Yes    & Nokia is nice. The article is precise. Samsung on the other hand is fabulous...  \\ \hline
\multirow{2}{*}{8}    & No     & Galaxy is great!!! The text is completely precise.  \\ %\hline
                      & Yes    & Galaxy is awesome!!! The author is not completely correct. In fact I think that samsung makes awesome products.  \\ \hline
 \end{tabular}}
\caption{Examples of responses generated by the system with or without knowledge base (KB) and with different sentiment levels.}\label{tab:sample_generation}
\end{table*}

\subsection{Surveys}\label{sec:surveys}
 We collected evaluation data via two online surveys on Amazon Mechanical Turk (\url{www.mturk.com}). In Survey~1, participants judged whether responses to articles were written by human or computer, akin to (a  simplified version of) the  Turing test \cite{Turing1950}. In Survey~2, responses were rated on their relevance to the article, in effect testing whether they abide by the Gricean Maxim of Quality. This is comparable to the study by \newcite{Ritter:2011:DRG:2145432.2145500} where people judged which of two responses was `best'. 

%We conducted two online surveys, one for evaluating human-likeness, and one for evaluating response relevance.

Each survey comprises 10 randomly ordered trials, corresponding to the 10 selected articles. First, the participant was presented with a snippet from the article. When clicking a button, the text was removed and its presentation duration recorded.
%Next, two multiple-choice comprehension questions appeared, inquiring about the snippet's topic and sentiment, respectively. Data on a trial was discarded from analysis if the topic question was answered incorrectly\footnote{We ignored replies to the sentiment question because a snippet's true sentiment was not always unambiguously clear.}
Next, a multiple-choice question asked about the snippet's topic. Data on a trial was discarded from analysis if the participant answered incorrectly or if the snippet was presented for less than 10 msec per character; we took these to be cases where the snippet was not properly read.
Next, the participant was shown a randomly ordered list of responses to the article.

 In Survey~1, four responses were presented for each article: three randomly selected from the pool of human response to that article and one generated by our system. The task was to categorize each response on a 7-point scale with labels `Certainly human/computer', `Probably human/computer', `Maybe human/computer' and `Unsure'.
In Survey~2, five responses were presented: three human responses and two computer-generated. The task was to rate the responses' relevance on a 7-point scale labeled `Completely (not) relevant', `Mostly (not) relevant', `Somewhat (not) relevant', and `Unsure'.
As a control condition, one of the human responses and one of the computer responses were actually taken from another article than the one just presented. In both surveys, the computer-generated responses presented to each participant were balanced across sentiment levels and generation functions ($g_{\rm base}$ and $g_{\rm kb}$).

After completing the 10 trials, participants provided basic demographic information, including native language. Data from non-native English speakers was discarded. Surveys~1 and 2 were, respectively, completed by 62 and 60 native English speakers. We provide these data  as supplementary materials.


\subsection{Analysis and Results}\label{sec:results}

%\paragraph{Survey~1:  Human- vs.\ Computer-Likeness.}
\paragraph{Survey~1:  Computer-Likeness Rating.}
Table~\ref{tab:Surv1_meanCI} shows the mean `computer-likeness'-ratings from 1 (`Certainly human') to 7 (`Certainly computer') for each response category. Clearly,  the human responses are rated as more human-like than the computer-generated ones: Our model did not generally  mislead the participants.
This may be due to the template-based response structure: Over the course of the survey,  human raters are likely to notice this structure and infer that such responses are computer-generated.

\begin{table*}
\parbox{0.47\linewidth}{
\centering
\scalebox{0.85}{
\begin{tabular}{|lr|}
\hline
Response Type    & Mean and CI \\ \hline
Human            & 3.33 $\pm$ 0.08 \\
Computer (all)   & 4.49 $\pm$ 0.15 \\
Computer ($-$KB) & 4.66 $\pm$ 0.20 \\
Computer ($+$KB) & 4.32 $\pm$ 0.22 \\
\hline
\end{tabular}}
\caption{Mean and 95\% confidence interval of computer-likeness rating  per response category.}\label{tab:Surv1_meanCI}
}
\hfill
\parbox{0.47\linewidth}{
\centering
\scalebox{0.85}{
\begin{tabular}{|lrrc|}
\hline
Factor                      & $b$   & $t$  & $P(b<0)$ \\ \hline
(intercept)                 & 3.590 &      &  \\
{\sc is\_comp}              & 0.193 & 2.11 & 0.015 \\
{\sc pos}                   & 0.069 & 4.76 & 0.000 \\
{\sc is\_comp $\times$ pos} & 0.085 & 6.27 & 0.000  \\
\hline
\end{tabular}}
\caption{Computer-likeness rating regression results, comparing human to computer responses.} \label{tab:Surv1_regr_hum_comp}
}
\end{table*}

To investigate whether such learning indeed occurs, a linear mixed-effects model was fitted, with predictor variables {\sc is\_comp} ($+$1:computer-generated, $-$1:human responses), {\sc pos} (position of the trial in the survey,  0 to 9), and the interaction between the two.
Table~\ref{tab:Surv1_regr_hum_comp} presents, for each factor in the regression analysis, the coefficient $b$ and its $t$-statistic. The coefficient equals the increase in computer-likeness rating for each unit increase in the predictor variable. The $t$-statistic is indicative of how much variance in the ratings is accounted for by the predictor.  We also obtained a probability distribution over each coefficient by Markov Chain Monte Carlo sampling using the {\tt R} package {\tt lme4} version 0.99 \cite{Bates2005}. From each coefficient's distribution, we estimate the posterior probability that $b$ is negative, which quantifies the reliability of the effect.

 The positive $b$ value for {\sc pos} shows that   responses drift towards the `computer'-end of the scale.
More importantly,   a positive interaction with {\sc is\_comp} indicates that the difference between human and computer responses becomes more noticeable as the survey progresses --- the participants  did learn to identify  computer-generated responses. %This effect is highly reliable, as evidenced by the near-zero probability that the interaction is in fact in the negative direction.
However, the positive coefficient for {\sc is\_comp} means that even at the very first trial, computer responses are considered to be more computer-like than human responses.

\paragraph{Factors Affecting Human-Likeness.}
Our finding that  the identifiability of computer-generated responses cannot be fully attributed to their repetitiveness,
raises the question: What makes a response more human-like? The results provide several insights into this matter. First,
the mean scores in Table~\ref{tab:Surv1_meanCI} suggest that including a knowledge base increases the responses' human-likeness.
To further investigate this, we performed a separate regression analysis, using only the data on computer-generated responses. This analysis also included predictors {\sc kb} ($+$1: knowledge base included, $-1$: otherwise), {\sc sent} ($sentiment_t$, from $-8$ to $+8$),  absolute value of {\sc sent}, and the interaction between {\sc kb} and {\sc pos}.

\begin{table*}
\parbox{0.47\linewidth}{
\centering
\scalebox{0.85}{
\begin{tabular}{|lrrc|}
\hline
Factor                    & $b$       & $t$     & $P(b<0)$ \\ \hline
(intercept)               &    4.022  &         & \\
{\sc kb}                  & $-$0.240  & $-$2.13 & 0.987 \\
{\sc pos}                 &    0.144	&    5.82 & 0.000 \\
{\sc sent}                &    0.035  &    2.98 & 0.002 \\
abs({\sc sent})           & $-$0.041  & $-$1.97 & 0.967 \\
{\sc kb $\times$ \sc pos} &    0.023  &    1.03 & 0.121 \\
\hline
\end{tabular}}
\caption{Computer-likeness rating regression results, comparing systems with and without KB.} \label{tab:Surv1_regr_KB}
}
\hfill
\parbox{0.47\linewidth}{
\centering
\scalebox{0.85}{
\begin{tabular}{|llr|}
\hline
Response Type                     & Source & Mean and CI \\ \hline
\multirow{2}{*}{Human}            & this   & 4.85 $\pm$ 0.11 \\
                                  & other  & 3.56 $\pm$ 0.18 \\
\multirow{2}{*}{Computer (all)}   & this   & 4.52 $\pm$ 0.16 \\
                                  & other  & 2.52 $\pm$ 0.15 \\
\multirow{2}{*}{Computer ($-$KB)} & this   & 4.53 $\pm$ 0.23 \\
                                  & other  & 2.46 $\pm$ 0.21 \\
\multirow{2}{*}{Computer ($+$KB)} & this   & 4.51 $\pm$ 0.23 \\
                                  & other  & 2.58 $\pm$ 0.22 \\
\hline
\end{tabular}}
\caption{Mean and 95\% confidence interval of relevance rating per response category. `Source' indicates whether the response is from the presented text snippet or a random other snippet.}\label{tab:Surv2_meanCI}
}
\end{table*}

\begin{table*}
\parbox{0.47\linewidth}{
\centering
\scalebox{0.85}{
\begin{tabular}{|lrrc|}
\hline
Factor                           & $b$      &  $t$    & $P(b<0)$ \\ \hline
(intercept)                      &    3.861 &         & \\
{\sc is\_comp}                   & $-$0.339 & $-$7.10 & 1.000 \\
{\sc source}                     &    0.824 &   16.80 & 0.000 \\
{\sc is\_comp $\times$ \sc pres} &    0.179 &    5.03 & 0.000 \\
\hline
\end{tabular}}
\caption{Relevance ratings regression results, comparing human to computer responses.} \label{tab:Surv2_regr_hum_comp}
}
\hfill
\parbox{0.47\linewidth}{
\centering
\scalebox{0.85}{
\begin{tabular}{|lrrc|}
\hline
Factor                       & $b$      & $t$     & $P(b<0)$ \\ \hline
(intercept)                  &    3.603 &         & \\
{\sc kb}                     &    0.026 &    0.49 & 0.322 \\
{\sc source}                 &    1.003 &   15.90 & 0.000 \\
{\sc sent}                   &    0.023 &    1.94 & 0.029 \\
abs({\sc sent})              & $-$0.017 & $-$0.93 & 0.819 \\
{\sc kb $\times$ \sc source} & $-$0.032 & $-$0.61 & 0.731 \\
\hline
\end{tabular}}
\caption{Relevance ratings regression results, comparing systems with and without KB.} \label{tab:Surv2_regr_KB}
}
\vspace{-0.15in}
\end{table*}

As can be seen in Table~\ref{tab:Surv1_regr_KB}, there is no reliable interaction between {\sc kb} and {\sc pos}:
the effect of including the {\sc kb} remained constant over the course of the survey. Furthermore, we see evidence that responses with a more positive sentiment are considered more computer-like. The (only weakly reliable) negative effect of the absolute value of sentiment suggests that more  extreme sentiments are considered more human-like. Apparently, people count on computer responses to be mildly positive, whereas human responses are expected to be more extreme, and extremely negative in particular.

\paragraph{Survey~2: Relevance Rating.}
The mean relevance scores in Table~\ref{tab:Surv2_meanCI} reveal that  a response is rated as more relevant to a snippet if it was actually a response to that snippet, rather than to another one. This reinforces the design choice to include items referring specifically  to the topic and sentiment of the author. However, human responses are considered more relevant than the computer-generated ones.
This is confirmed by a reliably negative regression coefficient for {\sc is\_comp} (see regression results in Table~\ref{tab:Surv2_regr_hum_comp}).

The analysis included the binary factor {\sc source} ($+$1 if the response came from the presented snippet, $-$1 if it came from a random article). There is a positive interaction between {\sc source} and {\sc is\_comp}, indicating that presenting a response from a random article is more detrimental to relevance for the computer-generated responses than for the human responses. This is not surprising, as the computer-generates responses (unlike the human responses)   always  explicitly  refer to the article's topic.

When analyzing only data on computer-generated responses, and including predictors for agenda sentiment and for presence of the knowledge base, we see that including the KB does not affect response relevance (see Table~\ref{tab:Surv2_regr_KB}). Also, there is no interaction between {\sc kb} and {\sc source}, that is, the effect of presenting a response from a different article does not differ between the models with and without the knowledge base. Possibly, responses are considered as more relevant if they have more positive sentiment, but the evidence for this is fairly weak.



